Skip to content

Feat: Add api to get machines with leaks#570

Open
srinivasadmurthy wants to merge 72 commits intoNVIDIA:mainfrom
srinivasadmurthy:sdmrlav2
Open

Feat: Add api to get machines with leaks#570
srinivasadmurthy wants to merge 72 commits intoNVIDIA:mainfrom
srinivasadmurthy:sdmrlav2

Conversation

@srinivasadmurthy
Copy link
Contributor

Description

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

Tested by setting debug features cpu2temp_alert and leak_alert in crates/health/Cargo.toml.
Setting these generate relevant overrides and used grpcurl to test GetHardwareLeaksReport API.

@srinivasadmurthy srinivasadmurthy requested a review from a team as a code owner March 16, 2026 05:46
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 16, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Contributor

@Matthias247 Matthias247 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know about the exact use-case for this.

But I'd prefer not to add APIs for searching for additional alerts for specific alert types, and instead rather extending the search filter passed to FindMachineIds to support searching by health probe IDs. That would be more universal and requires no new API.

@srinivasadmurthy
Copy link
Contributor Author

@Matthias247 @kensimon Thanks for your review feedback. I have implemented the suggest changes and am requesting a re-review.

@kensimon
Copy link
Contributor

I'm going to quote this comment from @srinivasadmurthy to get a discussion going:

This API is for use by RLA. Health monitor in carbide is scraping BMC sensors and detecting compute tray leaks. Once a leak is detected, it's placing a healthoverride with Leaks classification. RLA needs to query Carbide for leaking machines periodically, and then act on that. The returned data includes the leaking machine IDs, and their current power state. For each machine with a leak, RLA will issue two calls: UpdatePowerOptions to set the desired machine state to OFF, and then call AdminPowerControl to switch off the machine. Since this is supposed to respond to leaks reported by health monitor, it's not a general purpose search routine. Since responding to leaks needs to be fast, it's better to have a single API call that gives RLA all the information it needs, rather than getting Machine IDs first with a filter, and then call GetPowerOptions.

I really think if the goal here is to respond to leak alerts and shut machines off, having two different layers of polling (having to wait for the health monitor to scrape sensors from a very unreliable BMC API, then having to wait for RLA to pick up the results from the health monitor) is likely not going to be fast enough. You'd have to have an unreasonably fast polling interval to catch the alert in time to do something about it, and the cost of that is likely too much in a larger datacenter with lots of machines.

It seems like it'd be better for health events to stream directly to RLA, so that the instant a health override is added to carbide, it's also forwarded to RLA which can act on it directly, bypassing the polling altogether. Is this something we've thought about?

@zhaozhongn
Copy link

I'm going to quote this comment from @srinivasadmurthy to get a discussion going:

This API is for use by RLA. Health monitor in carbide is scraping BMC sensors and detecting compute tray leaks. Once a leak is detected, it's placing a healthoverride with Leaks classification. RLA needs to query Carbide for leaking machines periodically, and then act on that. The returned data includes the leaking machine IDs, and their current power state. For each machine with a leak, RLA will issue two calls: UpdatePowerOptions to set the desired machine state to OFF, and then call AdminPowerControl to switch off the machine. Since this is supposed to respond to leaks reported by health monitor, it's not a general purpose search routine. Since responding to leaks needs to be fast, it's better to have a single API call that gives RLA all the information it needs, rather than getting Machine IDs first with a filter, and then call GetPowerOptions.

I really think if the goal here is to respond to leak alerts and shut machines off, having two different layers of polling (having to wait for the health monitor to scrape sensors from a very unreliable BMC API, then having to wait for RLA to pick up the results from the health monitor) is likely not going to be fast enough. You'd have to have an unreasonably fast polling interval to catch the alert in time to do something about it, and the cost of that is likely too much in a larger datacenter with lots of machines.

It seems like it'd be better for health events to stream directly to RLA, so that the instant a health override is added to carbide, it's also forwarded to RLA which can act on it directly, bypassing the polling altogether. Is this something we've thought about?

Yes that's the long-term intention. In short-term, people were not sure about what health streaming/push mechanism should be, hence we opt to do this query model for now. It will still be very useful for other non-handling purpose (e.g., we will check if any tray in a rack has a leak before turning on host on a tray). But yes, handling scenario will switch to a fast method if needed.

srinivasadmurthy and others added 13 commits March 20, 2026 21:17
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description
Drive by to ensure crates/systemd/src/systemd.rs is parsable on
non-linux systems (macOS...).
This is a no-op for linux systems.

## Type of Change
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [ ] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [X] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Breaking Changes: NO.

Signed-off-by: Patrice Breton <pbreton@nvidia.com>
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
… even though there are alerts. (NVIDIA#515)

## Description

Fixes missing `host_machine_id` label in DPU logs and
`telemetry_stats_log_records_count` metric by fetching the id through
carbide API in forge-dpu-agent using the FindInterfaces request. The
label is needed for `SuppressExternalAlerting` to work with the
`noDpuLogsWarning` alert.

The request is retried if the id isn't immediately available, using the
`backon` crate to increase the retry interval to a maximum of every 5
minutes. Adds support for pending file contents to the `duppet` crate.

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [x] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)

https://nvbugspro.nvidia.com/bug/5668278

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [x] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

Manual testing in local dev to verify
- retry on failure
- `/run/otelcol-contrib/host-machine-id` is created/updated/unchanged as
expected.

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

---------

Signed-off-by: Tom Erickson <terickson@NVIDIA.COM>
Co-authored-by: Ken Simon <ken@kensimon.io>
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description

This PR adds NVUE telemetry collection for NVLink Switches to the health
service in a new collector:
- NvueRest (HTTP polling)

It is disabled by default and configurable (polling interval, request
timeouts, and enablement of telemetry per path).

Path enablement:
- system_health_enabled: Poll
[/nvue_v1/system/health](https://docs.nvidia.com/networking-ethernet-software/nvos-api-25024300/#/system/getSystemHealth)
- cluster_apps_enabled: Poll
[/nvue_v1/cluster/apps](https://docs.nvidia.com/networking-ethernet-software/nvos-api-25024300/#/cluster/getClusterApps)
- sdn_partitions_enabled: Poll
[/nvue_v1/sdn/partition](https://docs.nvidia.com/networking-ethernet-software/nvos-api-25024300/#/partition/getSdnPartitions)
- interfaces_enabled: Poll
[/nvue_v1/interface](https://docs.nvidia.com/networking-ethernet-software/nvos-api-25024300/#/interface/getInterface)

## Type of Change
<!-- Check one that best describes this PR -->
- [x] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [ ] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Testing
<!-- How was this tested? Check all that apply -->
- [x] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
~~Manual testing needed on rack (and soon to come).~~
Tested and working with `nvue_v1` running on NVOS 25

---------

Signed-off-by: Ivan Anisimov <ianisimov@nvidia.com>
Co-authored-by: Ivan Anisimov <ianisimov@nvidia.com>
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description
<!-- Describe what this PR does -->
part two of dpf sdk refactor. this moves internal state handling largely
to the dpf operator and lets the sdk trigger events when the state
handler loop should act on dpu state changes. still using a custom bfb
config and preloaded systemd services. adds dts as the first crd managed
dpu service. holding back on using the other dpu services (present in
other branch) until we can decide how those should be configured. the
reasoning is not to break existing functionality from milestone 1 as
those dpu services are untested. when all services are over, we should
be able to remove the (backported) tera config.

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [x] **Change** - Changes in existing functionality
- [ ] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

Closes FORGE-7959

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [x] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

---------

Signed-off-by: fspitulski <fspitulski@nvidia.com>
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
…ot running (NVIDIA#559)

adds "update" to the stop command. This makes sure that supervisord
knows about the dhcp server config before the stop is issued. this is
important on newly provisioned DPUs since the dhcp server config doesn't
exist initially.

Also stops fetching the timestamps when dhcp is not running to avoiding
noise in the logs

## Description
<!-- Describe what this PR does -->

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [X] **Change** - Changes in existing functionality
- [ ] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [X] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
…n gRPC APIs and a new DB table. (NVIDIA#490)

## Description

SPIFFE JWT-SVID machine identity support: add per-tenant identity
configuration and token delegation via gRPC APIs and a new DB table.

---

## 1. Database Layer (`api-db`)

### New: `crates/api-db/src/tenant_identity_config.rs`

- `TenantIdentityConfig` struct for per-org identity config
- `set()` – upsert identity config (issuer, audiences, TTL, signing key)
- `find()` – fetch config by org
- `delete()` – remove config
- `set_token_delegation()` – set token exchange config (endpoint, auth
method, client secret)
- `delete_token_delegation()` – clear delegation config
- Placeholder key generation (no real encryption yet)

### New:
`crates/api-db/migrations/20260225120000_tenant_identity_config.sql`

- `tenant_identity_config` table with:
- **Identity:** `issuer`, `default_audience`, `allowed_audiences`,
`token_ttl`, `subject_domain_prefix`, `enabled`
- **Signing:** `encrypted_signing_key`, `signing_key_public`, `key_id`,
`algorithm`, `master_key_id`
  - **Timestamps:** `created_at`, `updated_at`
- **Delegation:** `token_endpoint`, `auth_method`,
`encrypted_auth_method_config`, `subject_token_audience`,
`token_delegation_created_at`
- FK to `tenants(organization_id)` with `ON DELETE CASCADE`

---

## 2. gRPC API (`rpc`)

### New Proto Messages

- `GetIdentityConfiguration` / `SetIdentityConfiguration` /
`DeleteIdentityConfiguration`
- `GetTokenDelegation` / `SetTokenDelegation` / `DeleteTokenDelegation`
- Messages: `GetIdentityConfigRequest`, `IdentityConfigRequest`,
`IdentityConfigResponse`, `TokenDelegationRequest`,
`TokenDelegationResponse`, `GetTokenDelegationRequest`

---

## 3. API Handlers (`handlers/identity_config.rs`)

### New: `crates/api/src/handlers/identity_config.rs` (657 lines)

- `get_identity_configuration` – read config by org
- `set_identity_configuration` – upsert config with org validation
- `delete_identity_configuration` – delete config
- `get_token_delegation` – read delegation config
- `set_token_delegation` – upsert delegation config
- `delete_token_delegation` – clear delegation

### Helper Functions

- `compute_secret_hash()` – SHA256 hash for secrets
- `truncate_hash_for_display()` – truncate hash for display
- `struct_to_json` / `json_to_struct` – protobuf ↔ JSON
- `build_response_auth_config()` – omit secrets from responses

### Unit Tests (10)

- `compute_secret_hash`, `truncate_hash_for_display`
- `struct_to_json`, `json_to_struct`, `json_to_struct_roundtrip`
- `build_response_auth_config_omits_client_secret`, `truncates_hash`,
`passes_through_non_secret`, `non_object_returns_clone`

---

## 4. Configuration (`cfg/file.rs`)

### New: `MachineIdentityConfig`

- `enabled`, `algorithm`, `token_ttl_min`, `token_ttl_max`,
`token_endpoint_http_proxy`
- New `[machine_identity]` section in `CarbideConfig`

---

## 5. Integration Test Support (`api-test-helper`)

### New: `crates/api-test-helper/src/identity_config.rs`

- `set_identity_configuration()`, `get_identity_configuration()`,
`delete_identity_configuration()`
- `set_token_delegation()`, `get_token_delegation()`,
`delete_token_delegation()`
- Uses grpcurl for gRPC calls

---

## 6. Integration Tests (`api-integration-tests`)

### `run_identity_config_tests()` in `tests/lib.rs`

- Runs after tenant creation in `test_integration`
- Sets config → get → delete
- Sets config again → sets token delegation → get delegation → delete
delegation

---

## 7. Fixes

### `api_fixtures/mod.rs`

- Added `machine_identity: MachineIdentityConfig::default()` to
`get_config()` in `CarbideConfig`

---

## 8. Documentation

### `book/src/design/machine-identity/spiffe-svid-sdd.md`

- SDD for SPIFFE JWT-SVID machine identity
- Architecture, config flows, token delegation

---

## Files Changed (identity_config-related)

| File | Change |
|------|--------|
| `crates/api-db/src/tenant_identity_config.rs` | New |
| `crates/api-db/migrations/20260225120000_tenant_identity_config.sql` |
New |
| `crates/api/src/handlers/identity_config.rs` | New |
| `crates/api/src/handlers/mod.rs` | Register handler |
| `crates/api/src/handlers/machine_identity.rs` | Modified |
| `crates/api/src/api.rs` | Route new RPCs |
| `crates/api/src/cfg/file.rs` | Add `MachineIdentityConfig` |
| `crates/api/src/tests/common/api_fixtures/mod.rs` | Add
`machine_identity` |
| `crates/api-test-helper/src/identity_config.rs` | New |
| `crates/api-test-helper/src/lib.rs` | Export `identity_config` |
| `crates/api-integration-tests/tests/lib.rs` | Add
`run_identity_config_tests()` |
| `crates/rpc/proto/forge.proto` | New RPCs and messages |
| `book/src/design/machine-identity/spiffe-svid-sdd.md` | Updated |

## Type of Change
<!-- Check one that best describes this PR -->
- [x] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [ ] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
NVIDIA#447

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [x] Unit tests added/updated
- [x] Integration tests added/updated
- [x] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
This PR is part of larger feature implementation related to
NVIDIA#261.

<!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description
Some Lenovo platforms (HS350X V3) do not support SOL over SSH, so for
those we switch to IPMI if site-explorer reports `LenovoAMI` as the BMC
vendor. See:
NVIDIA#528

Note: These machines still report as `Lenovo` in the DMI data.

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [x] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: Krish Dandiwala <kdandiwala@nvidia.com>
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
…NVIDIA#573)

## Description
This PR removes the lenovo-specific `boot_first(Pxe)` from the instance
`invoke_power()` call.

When neither `boot_with_custom_ipxe` or
`run_provisioning_instructions_on_every_boot` flag is set, the user
expects a normal OS reboot, not a forced network boot. Boot-order
verification and correction for network-boot flows is now handled by the
state machine when required, so this lenovo override is unnecessary in
the default reboot path.

This covers the following scenarios:

1. Disk first: the customer OS boots immediately.
2. DPU/network first: the host attempts PXE, then exits/falls through to
the installed OS.
3. Custom install required, handled by the state machine:
a. If network boot is already first, the state handler proceeds with the
install flow.
b. If disk is first, the state handler corrects boot order (making
network/DPU first) and then proceeds with the install flow.

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [x] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
NVIDIA#530

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing performed
- [x] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: Krish Dandiwala <kdandiwala@nvidia.com>
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description
**Problem**: Machines can become unresponsive (e.g., left in BIOS menu,
OS issues, power faults) while they are in the `Ready` state, creating
silent failures that go undetected by carbide until instance creation
attempts fail. A metric was created for this issue but a host's health
state never reflected the timeout, so allocations weren't blocked.

**Fix**: Add scout heartbeat timeout health alert for machines in the
`Ready` state:

- Create a merge health override when `last_scout_contact_time` exceeds
`scout_reporting_timeout` (default 5 minutes)
- Remove the override automatically when scout heartbeat recovers in
Ready
- Clear the scout heartbeat timeout alert whenever the host transitions
out of Ready, so stale alerts do not leak across state changes.
- Continue emitting `hosts_with_scout_heartbeat_timeout` metric
- By default, the alert does not block allocations and suppresses
external alerting. To change this behavior, set the following in the
carbide config:
```
[host_health]
# Set to true to block allocations on hosts with scout heartbeat timeout (default: false)
prevent_allocations_on_scout_heartbeat_timeout = true
# Set to false to include these hosts in the unhealthy hosts Prometheus alert (default: true)
suppress_external_alerting_on_scout_heartbeat_timeout = false
```
## Type of Change
<!-- Check one that best describes this PR -->
- [x] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [ ] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [x] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

---------

Signed-off-by: Krish Dandiwala <kdandiwala@nvidia.com>
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
…VIDIA#371)

## Description

This builds on the [firmware management
work](NVIDIA#323) (and
`ApplyFirmware`) to additionally implement `ApplyProfile` within the DPA
provisioning workflow (it has been stubbed out with placeholders).

The `ApplyProfile` state now handles `mlxconfig` profile management --
resetting the device's `mlxconfig` parameters to factory defaults
between tenancies, and then optionally applying a named
`MlxConfigProfile` if one is configured for the interface. This behavior
of reset + apply updated values is the recommended guidance from NBU.

High level changes include:
1. New `mlxconfig_profile` column on `dpa_interfaces` -- an optional
profile name that maps into `carbide-api`'s `mlxconfig_profiles` config
map.
2. Reworking the `OpCode::ApplyProfile` variant to carry an
`Option<SerializableProfile>` (mirroring how `ApplyFirmware` carries a
`FirmwareFlasherProfile`).
3. `carbide-api`-side config lookup + serialization in
`build_apply_profile_command`.
4. `scout`-side implementation in `mlx_device::apply_profile()`.
5. Corresponding State Controller updates to handle both the reset-only
and reset + profile sync workflows.

In this workflow:
1. We check the interface's `mlxconfig_profile` field.
2. If `None`, we send `ApplyProfile { serialized_profile: None }`, and
`scout` will reset to factory defaults (to prepare for the next tenant)
and report success.
3. If set, we look it up in the `runtime_config.mlxconfig_profiles` map,
serialize it via `SerializableProfile::from_profile()`, and send it down
to `scout`.
4. If the profile name is set, but can't be found in config, we return
an error rather than sending `None` (which would silently reset without
applying any intended profile(s)).
5. `scout` always resets mlxconfig to factory defaults first, then
applies the profile if one was provided, and reports back via
`MlxObservation`.

The `ApplyProfile` state handler was also broken out into its own
`handle_apply_profile()` function, making it independently testable
without needing the full async state controller scaffolding. I need to
go back and do this in a few other pre-existing places.

Existing tests updated as needed, and new tests introduced.

Signed-off-by: Chet Nichols III <chetn@nvidia.com>

## Type of Change
<!-- Check one that best describes this PR -->
- [x] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [ ] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [x] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [x] Unit tests added/updated
- [x] Integration tests added/updated
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description
Update to new name

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [ ] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [X] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing performed
- [X] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description
<!-- Describe what this PR does -->
delete dpf-sdk crate, rename dpf-sdk-beta crate to dpf-sdk, update
references to dpf-beta to dpf. fix license headers to apache 2.

(old) dpf-sdk crate is already unused. this is just cleanup.

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [ ] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [x] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing performed
- [x] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

---------

Signed-off-by: fspitulski <fspitulski@nvidia.com>
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
chet and others added 25 commits March 20, 2026 21:17
## Description

This continues the work from
NVIDIA#606,
NVIDIA#602,
NVIDIA#598,
NVIDIA#596,
NVIDIA#608, and
NVIDIA#610.

TLDR is we had leaked a few things from `::rpc` into the `api-db` layer,
which we generally don't want to do, and now that we have
`STYLE_GUIDE.md`, it was good to practice what we preach.

Everything else has been refactored. This handles the last bit of it,
and then kicks `carbide-rpc` out of the `carbide-api-db` crate entirely.

Signed-off-by: Chet Nichols III <chetn@nvidia.com>

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [ ] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [x] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description

On numerous occasions, I have wanted to go into the admin UI to look at
DHCP lease information directly, but it's not there.

And I wanted to do it again today. And as usual, it's not there.

And I was like well, if it's not there, we should add it.

But what else should we add?

And then I thought, you know, we probably could use an IPAM section in
general, and have it be a place to look at:
- DHCP allocations (because we have `carbide-dhcp`).
- Authoritative DNS entries (because we have `carbide-dns`).
- Underlay networks/prefixes (because we manage those).
- Overlay networks/prefixes (and we manage those too).

Right now this is just taking care of the DHCP part, and I'm adding
placeholders for DNS and networks.

DHCP details include:
```
struct DhcpEntryDisplay {
    ip_address: String,
    mac_address: String,
    machine_id: String,
    hostname: String,
    created: String,
    last_dhcp: String,
    last_dhcp_rfc3339: String,
}
```

I hope people like this idea, because it's going to make me a lot
happier.

Signed-off-by: Chet Nichols III <chetn@nvidia.com>

## Type of Change
<!-- Check one that best describes this PR -->
- [x] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [ ] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
… to the admin UI (NVIDIA#627)

## Description

Now that we're starting to turn up rack components (`Rack`,
`PowerShelf`, `Switch`) in Carbide, and since we have `ExpectedRack`,
`ExpectedPowerShelf`, and `ExpectedSwitch`, it makes sense to have these
available in the admin UI under the `Rack`, `Power Shelf`, and `Switch`
sections, which right now just have managed ones, and not the expected
details.

This adds all of that, including linked information, and the status of
explored/adopted components.

Signed-off-by: Chet Nichols III <chetn@nvidia.com>

## Type of Change
<!-- Check one that best describes this PR -->
- [x] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [ ] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description

In another review, @Matthias247 pointed out that we could/should
probably have a pattern where `Args` can just `.into()` (or
`.try_into()?`) the underlying `Request` that they exist to populate.
Most cases of this are super straightforward, so I'm doing that to some
more of them (in addition to the ones I've already implemented).

At the end of the day, a "command" is now something like..

```
let req = args.try_into()?;
let resp = api_client.0.call(req).await?;
..do something w/resp
```

...and probably allows us to get even deeper into how we templatize
things in the admin CLI.

Signed-off-by: Chet Nichols III <chetn@nvidia.com>

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [ ] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [x] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description

This new architecture document describes how various network
partitioning technologies (DPUs, IB and NVLink) are integrated into
NICo. It also acts as a guideline that future integrations should
follow.

## Type of Change

- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [ ] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [x] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)

## Breaking Changes
- [ ] This PR contains breaking changes

## Testing

- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing performed
- [x] No testing required (docs, internal refactor, etc.)

## Additional Notes

Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description

Add mock hardware support for NVIDIA DGX H100 systems including:
- 8x H100 80GB HBM3 GPUs with HGX chassis
- 2x ConnectX-7B quad-port InfiniBand NICs
- 1x ConnectX-7A dual-port storage NIC
- 1x Intel E810 dual-port storage NIC
- 1x Intel X550 management NIC
- 1x BlueField-3 DPU (fixed count of 1)

New NIC hardware modules:
- nic_intel_e810: Intel E810 dual-port NIC
- nic_intel_x550: Intel X550 NIC
- nic_nvidia_cx7: ConnectX-7 variants (CX7A dual-port, CX7B quad-port)

Additional changes:
- Add AMI BMC vendor support with /SD settings path
- Make manager eth_interfaces and firmware_version optional
- Make NIC serial_number optional for Intel NICs without serials
- Add fixed_number_of_dpu() for platforms with fixed DPU count
- Add ok_no_content() helper for PATCH responses returning 204
- Add bmc_redfish_version() per hardware type

## Type of Change
- [x] **Add** - New feature or capability
- [x] **Change** - Changes in existing functionality
- [ ] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [x] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)

## Breaking Changes
- [ ] This PR contains breaking changes

## Testing
- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [x] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes

Signed-off-by: Dmitry Porokh <dporokh@nvidia.com>
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description
update helm/.github codeowner

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [ ] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing performed
- [x] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description

Add handling for force-delete and link the doc in the style guide

## Type of Change

- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [ ] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)

## Breaking Changes
- [ ] This PR contains breaking changes

## Testing

- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes

Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description
Currently WorkLockManager is explicitly cancelled with the toplevel
CancellationToken. But handles to it would still exist in things like
the state controllers, which might need to finish their current
iteration before cancelling.

So instead of explicitly cancelling WorkLockManager when the toplevel
cancellation signal is received, let it cancel only after all handles
are dropped (which they should be once all the dependents are done and
drop their handles.)

With this approach, the cancel signal will explicitly cancel the API
listener and all the background controllers, which will drop the
`Arc<Api>` handle, which will free the last handle to WorkLockManager,
which will shut it down. The toplevel JoinSet still requires all tasks
to be complete, so we still rely on all of this happening before the API
server actually shuts down.

(Currently only integration tests use the expilcit shutdown signal, but
adding a proper "graceful shutdown on SIGINT" handler can be done
trivially as a followup.)

## Type of Change
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [X] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)

## Breaking Changes
- [ ] This PR contains breaking changes

## Testing
- [x] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
See discussion in NVIDIA#586 for details.

Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
Update typo to DCO section

Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description
Removed old `DpuSsh` API, and adds Generic `BmcCredentials` API.

As for now it only supprots `UsernamePassword` credentials type, but in
future should support `SessionTokens` as well.

## Type of Change
<!-- Check one that best describes this PR -->
- [x] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [ ] **Fix** - Bug fixes
- [x] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
NVIDIA#460

## Breaking Changes
- [x] This PR contains breaking changes

Removes old DpuSSH credentials API.

## Testing
<!-- How was this tested? Check all that apply -->
- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [x] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: ianisimov <ianisimov@nvidia.com>
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
Continuation of
NVIDIA#628. Wanted to
do a bit first, and then do some more here, mainly since it's a lot to
look at.

TLDR is that we have a pattern where `Args` can just `.into()` (or
`.try_into()?`) the underlying `Request` that they exist for. These next
refactorings are also pretty straightforward, but I had punted them to a
separate PR because some of them weren't AS straightforward as the
original ones.

At the end of the day, a "command" is now something like..

```
let req = args.try_into()?;
let resp = api_client.0.call(req).await?;
..do something w/resp
```

..or in many cases (thanks @poroh for the callout here)..
```
api_client.0.call(args).await?;
```

...and probably allows us to get even deeper into how we templatize
things in the admin CLI.

## Description
<!-- Describe what this PR does -->

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [ ] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [x] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description

This adds an introductory DNS section to the admin UI that includes:
- Zones we are authoritative for.
- The records we currently serve (including record information).

There will be some iterative improvements here, but I want to get
something out for people to work with and look at and go "oh we should
do X and Y instead."

Some improvements would probably be:
- Pagination (either server or client side).
- Filtering (either server or client side).
- Better integration with the global search box (if it doesn't exist
yet).
- Etc.

Signed-off-by: Chet Nichols III <chetn@nvidia.com>

## Type of Change
<!-- Check one that best describes this PR -->
- [x] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [ ] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description

This introduces an Overlay Networks section to the IPAM section in the
admin UI.

The drilldown is:
- `/admin/ipam/overlay`
  - `/admin/ipam/overlay/prefix/{prefix-id}`
    - `/admin/ipam/overlay/segment/{segment-id}`

When you go to the main **Overlay Networks** page, you get a table with:
- Name
- VNI
- Prefixes

You can then click on one of the prefixes for that VNI, which brings you
to a **Prefix** view showing:
- Prefix
- Name
- Gateway
- Allocated IPs
- Segments (prefix)

You can THEN click on one of the segments in that VNI prefix, which
brings you to a **Segment** view showing:
- Segment
- Parent Prefix
- Table of allocated IPs (and the current instance its allocated to)

This **ALSO** adds the **Underlay** section, which I was going to keep
separate, and then just decided to make it all one PR.

The drilldown is:
- `/admin/ipam/underlay`
  - `/admin/ipam/underlay/segment/{segment-id}`

I was hoping to reuse the segment view, but since the data is different,
it's two separate ones right now.

When you go to the main **Underlay Networks** view, you get a table
with:
- Name
- Prefix
- Type (admin, underlay, host_inband).
- Gateway.
- Allocated IPs.

And then when you drill down to the segment, you get the parent prefix
and a table of allocated:
- IP
- MAC,
- Machine

Signed-off-by: Chet Nichols III <chetn@nvidia.com>

## Type of Change
<!-- Check one that best describes this PR -->
- [x] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [ ] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description

I don't know how this actually snuck through -- build + tests passed?
All said, it needs to go! This was the old placeholder for the IPAM
admin UI section. Now that all of the sub-sections have been filled in,
this isn't in use, and is causing errors for being dead code. Removing
both the struct and it's associated HTML file.

Signed-off-by: Chet Nichols III <chetn@nvidia.com>

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [ ] **Fix** - Bug fixes
- [x] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
…IA#656)

Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
…IA#656)

Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
…#654)

Signed-off-by: Andrew Forgue <aforgue@nvidia.com>
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
…o 90 minutes (NVIDIA#650)

## Description
This state requires the following operations

- power-cycle the host
- CheckHostConfig
- ConfigureBios & PollingBiosSetup
- SetBootOrder

Power-cycle the host and SetBootOrder are time-consuming some-times
depending on machine configurations. Some machines could take 20min for
first try of SetBootOrder operations.

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [ ] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [x] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing performed
- [x] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description

This was discussed at length internally amongst
[DSX](https://nvidianews.nvidia.com/news/nvidia-releases-vera-rubin-dsx-ai-factory-reference-design-and-omniverse-dsx-digital-twin-blueprint-with-broad-industry-support)
and [NCX](https://docs.nvidia.com/ncx/index.html) teams.

While historically [what is now known as] NICo has always generated its
own internal ID for *components*, a rack ID is kind of a grey area
between a component and an identifier. You can think of a rack as a
supercomputer, or something akin to a blade server, BUT, you can also
think of it as a place where components are stored.

With that, we decided the `RackId` should be a `String` whose SoT comes
from the DCIM (Datacenter Inventory Manager) system, since that is how
the BMS (Building Management System) identifies racks, and events for
racks, including things like leak detection.

Instead of NICo generating its own stable `RackId` and maintaining
mapping between it's internal rack ID and the DCIM rack ID, it was
decided the `RackId` should just come from the DCIM as part of "expected
component" ingestion: `ExpectedRack`, `ExpectedMachine`,
`ExpectedPowerShelf`, and `ExpectedSwitch` entities will all be enqueued
with the DCIM from the `RackId`. This ultimately allows for the DCIM
rack ID to be the one and only SoT, and allows for all components in a
DSX AI Factory to agree without confusion from alternative IDs.

So, strip away the hardware-backed `RackId` plans, and move towards a
newtype over `String`, allowing the DCIM to provide whatever it wants.

This change is backwards compatible:
- The database already stores as text, not a `uuid`, so we don't have to
worry about conversion issues.
- We are moving from an "encoded" value to an open `String` value, so
even pre-existing encoded `RackIds` work with the `String`-backed
newtype.
- The gRPC `common.RackId` is still the same. It was a `String` to begin
wtih.
- The JSON serialization of it is still the same. We use
`#[serde(transparent)]`, so it's still just `"rack_id":
"whatever-id-you-want"`.

The downside is now that it's a `String`, we can't `Copy`, so we have to
`.clone()` in certain cases. I pass around by reference as much as I
can, but not everywhere.

Also added some tests to ensure:
- The old format still works.
- New strings work.
- Empty strings don't work.

Signed-off-by: Chet Nichols III <chetn@nvidia.com>

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [x] **Change** - Changes in existing functionality
- [ ] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [x] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
…IA#661)

## Description
Now machine-a-tron supports handful of different types of hardware and
it is useful to see information about hardware type of specific machine
in TUI.

## Type of Change
- [x] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [ ] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [x] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)

## Breaking Changes
- [ ] This PR contains breaking changes

## Testing
- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [x] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes

Signed-off-by: Dmitry Porokh <dporokh@nvidia.com>
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description

Don't need these. Noticed them when doing some other work.

Longer technical explanation for those who don't know is that
`.to_string()` takes `&self`, and creates a new `String` from it, so
we're making a `.clone()` of something that would just be taken as a
reference anyway.

Signed-off-by: Chet Nichols III <chetn@nvidia.com>

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [ ] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [x] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->

Signed-off-by: Chet Nichols III <chetn@nvidia.com>
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
…A#620)

## Description
Fixes a gap where extension services marked for removal were not cleaned
up from instance config even after all DPUs reported successful
termination.

Previously, terminated extension service cleanup was only executed in
the `WaitingForExtensionServicesConfig` instance state; however,
extension service config updates do not transition the machine out of
Ready (only tenant state moves to `Configuring`). This change adds
instance extension service config cleanup in the `Ready` state so
terminated services can be cleaned up.

## Type of Change
<!-- Check one that best describes this PR -->
- [ ] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [x] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)

## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->

## Breaking Changes
- [ ] This PR contains breaking changes

<!-- If checked above, describe the breaking changes and migration steps
-->

## Testing
<!-- How was this tested? Check all that apply -->
- [x] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)

Signed-off-by: Felicity Xu <hanyux@nvidia.com>
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
@Matthias247
Copy link
Contributor

something is broken with this PR. It contains a lot of changes. Probably rebase gone wrong.
image

@srinivasadmurthy
Copy link
Contributor Author

Yes. I had not signed the commits, and followed github's recommendations to sign them (which included a rebase). That seems to have messed up the PR. I am going to create a new branch and create a new PR and discard this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.